Introduction

The United States has one of the highest teenage birth rate compared to  other developed countries (Wind, 2015). The teenage birth rate can be defined as the number of live births per thousand females between the ages of 15-19, per year. From my personal experience in high school I remember a very rudimentary sexual education class stressing abstinence which in no way prepared me or any of my peers with sufficient knowledge to prevent pregnancy or Sexually Transmitted Diseases (STDs). My reflections on the United States sexual education policy in states led me to think about how and if there is a pattern in teenage birth rates by location.  This led me to develop the following research question: What is the geospatial relationship of the birth rate of teenagers in the United States? Examining this spatial relationship is crucial to see if there are clusters of counties that are correlated. I also want to overlay the sexual education policy by state to observe if it is a variable that warrants future research of a spatial regression.

Research Question

What is the geospatial relationship of the birth rate of teenagers in the United States?

library(terra) |> suppressMessages()
## Warning: package 'terra' was built under R version 4.2.3
library(tmap)|> suppressMessages()
library(stringr) |> suppressMessages()
## Warning: package 'stringr' was built under R version 4.2.3
library(spdep) # package for spatial dependence
## Warning: package 'spdep' was built under R version 4.2.3
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.2.3
## Loading required package: sf
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(sf)|> suppressMessages()
library(dplyr)|> suppressMessages()
## Warning: package 'dplyr' was built under R version 4.2.3
library(tidyverse)|> suppressMessages()
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
library(maps)|> suppressMessages()
## Warning: package 'maps' was built under R version 4.2.3
library(readxl)|> suppressMessages()
## Warning: package 'readxl' was built under R version 4.2.3
library(tigris)
## Warning: package 'tigris' was built under R version 4.2.3
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
## 
## Attaching package: 'tigris'
## The following object is masked from 'package:terra':
## 
##     blocks
library(ggplot2)
library(ggpattern)
## Warning: package 'ggpattern' was built under R version 4.2.3
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.2.3
options(tigris_use_cache = TRUE)

Method

I used data from the US census to get the US geospatial data. For teenage birth rates by county I used data from the National Center for Health Statistics and filtered it to 2016. I chose this year since it was also the year I was able to get data on sexual education policy by states from the Guttmacher Institute.

# Load state and county data

states <- tigris::states(cb = TRUE, class = "sf")  
## Retrieving data for the year 2021
counties <- tigris::counties(cb = TRUE, class = "sf")
## Retrieving data for the year 2022
# cb = TRUE for a low-resolution version for faster processing

# Visualize the map with state and county boundaries
all_sf <- bind_rows(
  mutate(states, type = "state"),
  mutate(counties, type = "county")
)
mapview::mapview(all_sf)
# Filter out US territories from dataset and Alaska and Hawaii
# List of US states I want to keep
us_states <- c("alabama", "arizona", "arkansas", "california", 
               "colorado", "connecticut", "delaware", "florida", "georgia", 
               "idaho", "illinois", "indiana", "iowa", "kansas", 
               "kentucky", "louisiana", "maine", "maryland", "massachusetts", 
               "michigan", "minnesota", "mississippi", "missouri", "montana", 
               "nebraska", "nevada", "new hampshire", "new jersey", "new mexico", 
               "new york", "north carolina", "north dakota", "ohio", 
               "oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
               "south dakota", "tennessee", "texas", "utah", "vermont", 
               "virginia", "washington", "west virginia", "wisconsin", "wyoming")
all_sf <- all_sf %>%
  filter(tolower(STATE_NAME) %in% us_states)
unique(all_sf$STATE_NAME)
##  [1] "Alabama"        "Arizona"        "Arkansas"       "California"    
##  [5] "Colorado"       "Delaware"       "Florida"        "Illinois"      
##  [9] "Georgia"        "Idaho"          "Michigan"       "Indiana"       
## [13] "Iowa"           "New York"       "Kansas"         "Kentucky"      
## [17] "Louisiana"      "Maine"          "Maryland"       "Massachusetts" 
## [21] "Minnesota"      "Mississippi"    "Missouri"       "Montana"       
## [25] "Nebraska"       "New Jersey"     "New Mexico"     "North Carolina"
## [29] "North Dakota"   "Ohio"           "Oklahoma"       "Oregon"        
## [33] "Pennsylvania"   "South Carolina" "South Dakota"   "Tennessee"     
## [37] "Texas"          "West Virginia"  "Utah"           "Vermont"       
## [41] "Virginia"       "Washington"     "Wisconsin"      "Wyoming"       
## [45] "Connecticut"    "Rhode Island"   "Nevada"         "New Hampshire"

I had to do some coding of my own in order to get the different state sexual education policies in one to create a comprehensive column. States have different policies regarding sexual education. Some do not mandate sex education at all and for those that do not mandate it, they still have directives about what needs to be included if the school decides to teach it anyways. I coded these as not having sexual education since it would be up to each county’s school board to decide whether or not to teach sexual education at all. Such a localized policy scope was not feasible for this project. Others have policies regarding whether abstinence, contraception, or STD content be included in a schools curriculum. However, I included solely abstinence and contraception policies in my sexual education policy code as I am not looking at STD rates among teenagers. The four categories I coded for are “No Sex Ed” (no state mandate to provide sexual education), “No Contraception/ Abstinence Pregnancy Education” (state mandates sexual education but does not require abstinence or contraception content), “Abstinence” (state mandates content about abstinence), and “Abstinence/Contraception” (state mandates content about abstinence and contraception).

#Load birth rate data from NCHS
birthrate <- read_excel("NCHS2016-Teen_Birth_Rates.xlsx")

#Load Sex Education data by State from Guttmacher Data Center
sexedu <-read_excel("GuttmacherDataCenter-SexEdbyState_treated.xlsx")
# Merge birthrate data with all_sf
all_sf <- all_sf %>%
  left_join(birthrate, by = c("NAME" = "County", "STATE_NAME" = "State"))

# Merge sex education data with all_sf
all_sf <- all_sf %>%
  left_join(sexedu, by = c("STATE_NAME" = "U.S. State"))
# Determine best # of groups to put separate birthrate data into
library(cluster)  # For clustering algorithms
## Warning: package 'cluster' was built under R version 4.2.3
## 
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
## 
##     votes.repub
print(head(birthrate))
## # A tibble: 6 × 9
##    Year State   County `State FIPS Code` `County FIPS Code` `Combined FIPS Code`
##   <dbl> <chr>   <chr>              <dbl>              <dbl>                <dbl>
## 1  2016 Alabama Autau…                 1                  1                 1001
## 2  2016 Alabama Baldw…                 1                  3                 1003
## 3  2016 Alabama Barbo…                 1                  5                 1005
## 4  2016 Alabama Bibb                   1                  7                 1007
## 5  2016 Alabama Blount                 1                  9                 1009
## 6  2016 Alabama Bullo…                 1                 11                 1011
## # ℹ 3 more variables: `Birth Rate` <dbl>, `Lower Confidence Limit` <dbl>,
## #   `Upper Confidence Limit` <dbl>
# Assuming 'birthrate' is your dataframe and 'Birth_Rate' is the column
data <- birthrate$`Birth Rate` %>% na.omit()  # Remove NA values for clustering
wss <- map_dbl(1:10, function(k) {
  kmeans(data, centers = k, nstart = 25)$tot.withinss
})

# Plotting the within-group sum of squares by number of clusters
qplot(1:10, wss, geom = "line") +
  labs(x = "Number of Clusters", y = "Total Within Sum of Squares", title = "Elbow Method for Optimal k")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Clean dataset of NAs

# Need to remove NAs so I can run Spatial Autocorrelations/Moran's Test
all_sf_clean <- all_sf %>%
  filter(!is.na(`Birth Rate`))

##Create State Geometries

states_geom <- all_sf_clean %>%
  group_by(STATE_NAME) %>%
  summarise(geometry = st_union(geometry)) %>%
  ungroup()

# Check the result
print(states_geom)
## Simple feature collection with 47 features and 1 field
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -124.7631 ymin: 24.5213 xmax: -66.9499 ymax: 49.00249
## Geodetic CRS:  NAD83
## # A tibble: 47 × 2
##    STATE_NAME                                                           geometry
##    <chr>                                                          <GEOMETRY [°]>
##  1 Alabama    MULTIPOLYGON (((-88.05235 30.50559, -88.04504 30.50119, -88.02675…
##  2 Arizona    POLYGON ((-110.0007 36.99798, -110.0009 36.9985, -110.0218 36.998…
##  3 Arkansas   POLYGON ((-94.43084 35.39346, -94.43066 35.39248, -94.43146 35.39…
##  4 California MULTIPOLYGON (((-119.5854 38.71321, -119.5871 38.71435, -119.5877…
##  5 Colorado   POLYGON ((-109.0435 37.48482, -109.0458 37.37499, -109.0458 37.35…
##  6 Delaware   MULTIPOLYGON (((-75.56681 39.50919, -75.56545 39.50941, -75.56144…
##  7 Florida    MULTIPOLYGON (((-84.43628 29.97898, -84.43546 29.98175, -84.43617…
##  8 Georgia    MULTIPOLYGON (((-84.97196 32.3777, -84.97433 32.37458, -84.97727 …
##  9 Idaho      POLYGON ((-117.0398 46.81544, -117.0395 46.73673, -117.0395 46.73…
## 10 Illinois   POLYGON ((-91.51308 40.17854, -91.51196 40.17044, -91.50822 40.15…
## # ℹ 37 more rows

Mapping

I also created maps in order to visualize the average teenage birth rate by county and state and sexual education policy by state. Then I overlayed both of them in order to see if there were any obvious patterns to inform a hypothesis.

Map birth rate by county

# Set color palette I want to use for the map
custom_palette <- colorRampPalette(c("blue", "pink", "red"))(100)  

tmap_mode("plot")
## tmap mode set to 'plot'
tm <- tm_shape(all_sf) +
  tm_polygons("Birth Rate", id = "NAME",
              palette = custom_palette, title = "Teen Birth Rate by County",
              style = "quantile", n = 3,
              border.col = "black", border.alpha = 0.5) +
  tm_shape(states_geom) +  # Add the shape for the states
  tm_borders(lwd = 1.5, col = "black") +  # Draw state borders
  tm_layout(
    frame = FALSE,
    main.title = "Average Teen Birth Rate by County",
    font.size = 10,  # Adjust base font size
    outer.margins = c(0, 0.02, 0.02, 0.02),  # Reduce outer margins
    legend.position = c("right", "bottom"),
    legend.bg.color = "white",  # Adding background color to legend for better readability
    legend.bg.alpha = 0.7,  # Semi-transparent background for legend
    legend.text.size = 0.8,  # Adjust legend text size
    legend.title.size = 0.9,  # Adjust legend title size
    tm_title("Teen Birth Rate by County")  # Corrected to be inside tm_layout for tmap v3+
  ) +
  tm_scale_bar(
    position = c("left", "bottom"),
    breaks = c(0, 100, 200, 300, 400, 500)
  )
## tm_polygons: Deprecated tmap v3 code detected. Code translated to v4
## Warning: The 'main.title' argument of 'tm_layout()' is deprecated as of tmap
## 4.0. Please use 'tm_title()' instead.
## As of version 4.0, 'tm_scale_bar()' is deprecated. Please use 'tm_scalebar()' instead.FALSE
print(tm)
## Warning: Some legend items or map compoments do not fit well (e.g. due to the
## specified font size).
## Warning: Some legend items or map compoments do not fit well (e.g. due to the
## specified font size).

Average Teen Birth Rate by State

# Aggregate birth rate data to the state level by calculating the mean birth rate per state
state_birthrate <- all_sf %>%
  group_by(STATE_NAME) %>%
  summarise(Avg_Birth_Rate = mean(`Birth Rate`, na.rm = TRUE), .groups = 'drop')

state_birthrate <- as.data.frame(state_birthrate)

# Join
states <- left_join(states, sexedu, by = c("NAME" = "U.S. State"))
states <- left_join(states, state_birthrate, by = c("NAME" = "STATE_NAME")) 
#Filter out States with NA in Birth Rate Column
states <- states %>%
 filter(!is.na(Avg_Birth_Rate))
# Map
tm_states_birthrate <- tm_shape(states) +
  tm_polygons("Avg_Birth_Rate", id = "NAME",
              palette = colorRampPalette(c("blue", "pink", "red"))(100), 
              title = "Average Teen Birth Rate by State",
              style = "quantile", n = 3,
              border.col = "black", border.alpha = 0.5) +
  tm_layout(
    main.title = "Average Birth Rate by State",
    main.title.position = "center",
    legend.text.size = 0.6,
    legend.title.size = 0.7) +
  tm_scale_bar(
    position = c("left", "bottom"),
    breaks = c(0, 100, 200, 300, 400, 500)
  )
## tm_polygons: Deprecated tmap v3 code detected. Code translated to v4
## Warning: The 'main.title' argument of 'tm_layout()' is deprecated as of tmap
## 4.0. Please use 'tm_title()' instead.
## As of version 4.0, 'tm_scale_bar()' is deprecated. Please use 'tm_scalebar()' instead.FALSE
print(tm_states_birthrate)

Sex Education Level by State

# Map
tm_map <- tm_shape(states) +
  tm_polygons("SexEdu",
              title = "Sex Education Level",
              palette = colorRampPalette(c("lightblue", "blue", "darkblue","red"))(4),  # Ensure the palette function is called correctly
              id = "NAME") +  # Close parentheses correctly here
  tm_layout(
    main.title = "Sex Education Policy by State",
    main.title.position = "center",
    legend.text.size = 0.6,
    legend.title.size = 0.7)
## tm_polygons: Deprecated tmap v3 code detected. Code translated to v4
## Warning: The 'main.title' argument of 'tm_layout()' is deprecated as of tmap
## 4.0. Please use 'tm_title()' instead.
print(tm_map)

Spatial Autocorrelation Moran’s I Test

I use spatial autocorrelation to examine if neighboring counties have similar teenage birth rates. I ran the global Moran’s I test for different criteria to see which would give me the highest statistic meaning that there is a positive correlation and that areas are surrounded by those of similar values. I then used this statistic to determine which criteria would be the best to use for local Moran’s and see specifically how and which districts cluster together in a significant way.

Plotting Neighbors

#See neighbors
nb<-poly2nb(all_sf_clean,queen=TRUE)
nb
## Neighbour list object:
## Number of regions: 3008 
## Number of nonzero links: 17320 
## Percentage nonzero weights: 0.1914222 
## Average number of links: 5.757979 
## 4 regions with no links:
## 713 1073 1394 1577
## 6 disjoint connected subgraphs

Computing Spatial Autocorrelation: Global Moran’s I

Neighborhood based on contiguity (row standardized weights)

# Create Binary style weights for neighbors 
nb_clean <- poly2nb(all_sf_clean)

nbw <- nb2listw(nb_clean, style = "W", zero.policy = T)

# Hypothesis set to "greater", meaning I expect a positive autocorrelation
gmoran <- moran.test(all_sf_clean$`Birth Rate`, nbw,
                     alternative = "greater")
gmoran
## 
##  Moran I test under randomisation
## 
## data:  all_sf_clean$`Birth Rate`  
## weights: nbw  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 48.27, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5309940752     -0.0003330003      0.0001211634
moran.plot(all_sf_clean$`Birth Rate`, nbw, labels=F)

Neighborhood based on contiguity (inverse distance weights)

centroid_coords <- st_centroid(all_sf_clean)
## Warning: st_centroid assumes attributes are constant over geometries

invert distances such that closer areas have higher values

# compute distance for all spatial neighbour links
dists <- nbdists(nb, centroid_coords)

# Invert distances, handle zero by replacing with NA or a large number
inverted_dists <- lapply(dists, function(x) ifelse(x == 0, NA, 1/x))

# Clean NAs and infinite values in inverted distances
inverted_dists <- lapply(inverted_dists, function(x) {
  x[is.na(x) | is.infinite(x)] <- 0  # Replace NA or infinite with 0 or some small number
  return(x)
})

# Use style=B to maintain the weights as set with glist
nbb <- nb2listw(nb, glist = inverted_dists, style = "B", zero.policy = TRUE)
gmoran_inverted <- moran.test(all_sf_clean$`Birth Rate`, nbb,
                     alternative = "greater")
gmoran_inverted
## 
##  Moran I test under randomisation
## 
## data:  all_sf_clean$`Birth Rate`  
## weights: nbb  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 34.835, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5804905913     -0.0003330003      0.0002780115
moran.plot(all_sf_clean$`Birth Rate`, nbb, labels=F)

Neighbors within Distance of 75km

#Check to see if coordinates are planar or geographical: Results are geographical
crs<-st_crs(all_sf_clean)
crs
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]
# Neighbors based on distance
# d1: lower distance bound 
# d2: upper distance bound in the metric of the points if planar coordinates, in km if in geographical coordinates
nb75 <- dnearneigh(x = centroid_coords, d1 = 0, d2 = 75)
        
nbw75<- nb2listw(nb75, style = "W", zero.policy = T)

#hypothesis set to "greater", meaning I expect a positive autocorrelation
gmoran75 <- moran.test(all_sf_clean$`Birth Rate`, nbw75,
                     alternative = "greater")
gmoran75
## 
##  Moran I test under randomisation
## 
## data:  all_sf_clean$`Birth Rate`  
## weights: nbw75  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 48.779, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5053910671     -0.0003459011      0.0001074929

5 nearest neighbors (row standardised weights)

nb5 <- knn2nb(knearneigh(centroid_coords, k = 5)) 
plot(st_geometry(all_sf_clean), border = "lightgray")
plot.nb(nb5, st_geometry(all_sf_clean), add = TRUE, arrows=T)
## Warning in st_point_on_surface.sfc(coords): st_point_on_surface may not give
## correct results for longitude/latitude data

nbw5<- nb2listw(nb5, style = "W", zero.policy = T)

#hypothesis set to "greater", meaning I expect a positive autocorrelation
gmoran1 <- moran.test(all_sf_clean$`Birth Rate`, nbw5,
                     alternative = "greater")
gmoran1
## 
##  Moran I test under randomisation
## 
## data:  all_sf_clean$`Birth Rate`  
## weights: nbw5    
## 
## Moran I statistic standard deviate = 48.503, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5399484531     -0.0003325574      0.0001240814
moran.plot(all_sf_clean$`Birth Rate`, nbw5, labels=F)

Local Moran’s with 5 nearest neighbors

lmoran <- localmoran(all_sf_clean$`Birth Rate`, nbw5, alternative = "two.sided")
head(lmoran)
##           Ii          E.Ii      Var.Ii       Z.Ii Pr(z != E(Ii))
## 1 0.41637243 -8.945163e-05 0.053737685 1.79653541     0.07240941
## 2 0.16105573 -1.179860e-05 0.007088511 1.91306837     0.05573930
## 3 0.75734012 -2.618316e-04 0.157267092 1.91039052     0.05608295
## 4 0.29553932 -7.669452e-05 0.046074495 1.37720207     0.16844978
## 5 0.13353838 -4.065801e-05 0.024426321 0.85469129     0.39272207
## 6 0.03793513 -6.536127e-04 0.392433505 0.06159956     0.95088173
tmap_mode("plot")
## tmap mode set to 'plot'
# local Moran's I 
all_sf_clean$lmI <- lmoran[, "Ii"] 
# p-values corresponding to alternative greater
all_sf_clean$lmp <- lmoran[, "Pr(z != E(Ii))"]
all_sf_clean$lmI_sign <- all_sf_clean$lmI
# Handling NA values
all_sf_clean$lmI_sign[all_sf_clean$lmp >= 0.05 & !is.na(all_sf_clean$lmp)] <- NA

# Assign 'non-significant' to 'quadr' where p-values are not significant or are NA
all_sf_clean$quadr[all_sf_clean$lmp >= 0.05 & !is.na(all_sf_clean$lmp)] <- "non-significant"


# get quadrant information
all_sf_clean$quadr <- attributes(lmoran)$quadr$mean
levels(all_sf_clean$quadr) <- c(levels(all_sf_clean$quadr), "non-significant")
all_sf_clean[(all_sf_clean$lmp >= 0.05) & !is.na(all_sf_clean$lmp), "quadr"] <- "non-significant"

#plot
tm_shape(all_sf_clean) +
  tm_polygons("quadr", 
              palette = c("blue", "lightpink", "skyblue2", "red", "white"), 
              lwd=0.1, alpha = 0.7) +
tm_shape(states_geom) +
  tm_borders(lwd = 1.5, col = "black") + # Adding state borders
  tm_layout(main.title = "Local Moran's I Clusters by County",
          main.title.size = 0.8,
          legend.title.size = 0.7,
          legend.text.size = 0.6)
## tm_polygons: Deprecated tmap v3 code detected. Code translated to v4
## Warning: The 'main.title' argument of 'tm_layout()' is deprecated as of tmap
## 4.0. Please use 'tm_title()' instead.

Table of Counties in each group

table(all_sf_clean$quadr)
## 
##         Low-Low        High-Low        Low-High       High-High non-significant 
##             364              21              39             396            2188

Compute Local Moran’s with highest Moran’s value (Inverted Distance)

lmoran <- localmoran(all_sf_clean$`Birth Rate`, nbb, alternative = "two.sided")
head(lmoran)
##            Ii          E.Ii       Var.Ii      Z.Ii Pr(z != E(Ii))
## 1 0.058637457 -1.222702e-05 8.424578e-04 2.0206524     0.04331576
## 2 0.015460107 -1.500238e-06 8.417534e-05 1.6852417     0.09194195
## 3 0.116428945 -4.541539e-05 3.030178e-03 2.1159054     0.03435286
## 4 0.035524414 -1.023337e-05 6.058169e-04 1.4437144     0.14881932
## 5 0.016654504 -4.791399e-06 4.491266e-04 0.7860899     0.43181482
## 6 0.004505557 -2.825977e-05 5.589420e-04 0.1917699     0.84792241
# local Moran's I
all_sf_clean$lmI <- lmoran[, "Ii"] 
# p-values corresponding to alternative greater
all_sf_clean$lmp <- lmoran[, "Pr(z != E(Ii))"]
all_sf_clean$lmI_sign <- all_sf_clean$lmI
# Handle NA values explicitly in your subsetting
all_sf_clean$lmI_sign[all_sf_clean$lmp >= 0.05 & !is.na(all_sf_clean$lmp)] <- NA

# Assign 'non-significant' to 'quadr' where p-values are not significant or are NA
all_sf_clean$quadr[all_sf_clean$lmp >= 0.05 & !is.na(all_sf_clean$lmp)] <- "non-significant"


# get quadrant information
all_sf_clean$quadr <- attributes(lmoran)$quadr$mean
levels(all_sf_clean$quadr) <- c(levels(all_sf_clean$quadr), "non-significant")
all_sf_clean[(all_sf_clean$lmp >= 0.05) & !is.na(all_sf_clean$lmp), "quadr"] <- "non-significant"

#plot
tm_shape(all_sf_clean) +
  tm_polygons("quadr", 
              palette = c("blue", "lightpink", "skyblue2", "red", "white"), 
              lwd=0.1, alpha = 0.7) +
tm_shape(states_geom) +
  tm_borders(lwd = 1.5, col = "black") + # Adding state borders
  tm_layout(main.title = "Local Moran's I Clusters by County",
          main.title.size = 0.8,
          legend.title.size = 0.7,
          legend.text.size = 0.6)
## tm_polygons: Deprecated tmap v3 code detected. Code translated to v4
## Warning: The 'main.title' argument of 'tm_layout()' is deprecated as of tmap
## 4.0. Please use 'tm_title()' instead.

Table of Counties in each group

table(all_sf_clean$quadr)
## 
##         Low-Low        High-Low        Low-High       High-High non-significant 
##             320             301              85             147            2155

Local Moran’s- Neighborhood based on contiguity (row standardized weights):

lmoran <- localmoran(all_sf_clean$`Birth Rate`, nbw, alternative = "two.sided")
head(lmoran)
##             Ii          E.Ii      Var.Ii         Z.Ii Pr(z != E(Ii))
## 1  0.444257284 -8.945163e-05 0.044766487  2.100126664      0.0357177
## 2  0.115669325 -1.179860e-05 0.005059849  1.626273997      0.1038914
## 3  0.631426794 -2.618316e-04 0.098193706  2.015864006      0.0438142
## 4  0.252751872 -7.669452e-05 0.032888428  1.394134209      0.1632771
## 5  0.155829267 -4.065801e-05 0.030543072  0.891879072      0.3724578
## 6 -0.005137012 -6.536127e-04 0.280122898 -0.008470968      0.9932412

###Plot Morans I Clusters

## tmap mode set to interactive viewing
all_sf_clean$lmI <- lmoran[, "Ii"] # local Moran's I
# p-values corresponding to alternative greater
all_sf_clean$lmp <- lmoran[, "Pr(z != E(Ii))"]
all_sf_clean$lmI_sign <- all_sf_clean$lmI
# Handle NA values explicitly in your subsetting
all_sf_clean$lmI_sign[all_sf_clean$lmp >= 0.05 & !is.na(all_sf_clean$lmp)] <- NA

# Assign 'non-significant' to 'quadr' where p-values are not significant or are NA
all_sf_clean$quadr[all_sf_clean$lmp >= 0.05 & !is.na(all_sf_clean$lmp)] <- "non-significant"


# get quadrant information
all_sf_clean$quadr <- attributes(lmoran)$quadr$mean
levels(all_sf_clean$quadr) <- c(levels(all_sf_clean$quadr), "non-significant")
all_sf_clean[(all_sf_clean$lmp >= 0.05) & !is.na(all_sf_clean$lmp), "quadr"] <- "non-significant"

#plot
tm_shape(all_sf_clean) +
  tm_polygons("quadr", 
              palette = c("blue", "lightpink", "skyblue2", "red", "white"), 
              lwd=0.1, alpha = 0.7) +
tm_shape(states_geom) +
  tm_borders(lwd = 1.5, col = "black") + # Adding state borders
  tm_layout(main.title = "Local Moran's I Clusters by County",
          main.title.size = 0.8,
          legend.title.size = 0.7,
          legend.text.size = 0.6)
## tm_polygons: Deprecated tmap v3 code detected. Code translated to v4
## Warning: The 'main.title' argument of 'tm_layout()' is deprecated as of tmap
## 4.0. Please use 'tm_title()' instead.

table(all_sf_clean$quadr)
## 
##         Low-Low        High-Low        Low-High       High-High non-significant 
##             373              24              48             430            2133

Choosing which to use

Running the local moran’s with the contiguity-based row standardized weights calculated for Neighborhoods provided me with the highest number of significant neighborhoods in clusters. The global moran’s statistic result was similar to that of the 5 nearest neighbors and I wanted to see if broader spatial trends might appear beyond just including the 5 nearest neighbors.

Results

I got the following results: - neighbors using row standardized weights: 0.5309940752 - neighbors using inverse distance weights: 0.5804905913 - neighbors within a distance of 75km: 0.5053910671 - 5 nearest neighbors using row standardised weights: 0.5399484531.
The highest value attained by running the global Moran’s I test for the inverse distance weights, which discusses the effect of distance and implies that as distance increases the relationship decreases between counties. However, I am more interested in the relationship of average birth rates of neighboring counties since I want to observe where clusters are happening, not necessarily look at relationship distance plays. By observing this I will be able to hypothesize what variables would be worth including in a future spatial regression since I will better understand the significant spatial distribution of teenage birth rates.

I ran the impact of the local Moran’s I test for the neighborhood based on contiguity with row standardized weights which had the third largest score global Moran’s statistic and the highest number of counties of significance in clusters. The positive statistic suggests that the counties are surrounded by counties with similar average teenage birth rates. I got the following clusters: Low-Low (373), High-Low (24), Low-High (48), High-High (430), non-significant (2133).

Using Mapping to Explore whether State Sex. Education Policy is a Variable of Interest

Average Teenage Birth Rate and Sex Education Level by State

states$SexEdu <- factor(states$SexEdu, levels = c("No Sex Ed", "No Contraception/Abstinence Pregnancy Education","Abstinence", "Abstinence/Contraception"))
states$Rate_Group <- cut(states$Avg_Birth_Rate, breaks = 3, labels = c("Low", "Medium", "High"))
ggplot(data = states) +
  # Layer for color based on birth rate
  geom_sf(aes(fill = Rate_Group), color = "white", size = 0.2) +
  # Layer for patterns based on sex education (need a transparent fill to keep birth rate colors)
  geom_sf_pattern(aes(pattern = SexEdu),
                  fill = NA,  # No fill color, just the pattern
                  pattern_color = "black",
                  pattern_density = 0.1,
                  pattern_spacing = 0.02,
                  color = NA) +  # No border color for patterns
  scale_fill_manual(values = c("Low" = "blue", "Medium" = "pink", "High" = "red"),
                    name = "Average Birth Rate") +
  scale_pattern_manual(values = c("No Sex Ed" = "stripe",
                                  "No Contraception/Abstinence Pregnancy Education" = "crosshatch",
                                  "Abstinence" = "circle",
                                  "Abstinence/Contraception" = "none"),
                       name = "Sex Education Level") +
  labs(title = "Teenage Birth Rates and Sex Ed. Policy by State") +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5),
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.title = element_text(size = 10), # Smaller legend title
    legend.key.size = unit(0.5, "cm"),
    panel.grid.major = element_blank(),  # Remove major gridlines
    panel.grid.minor = element_blank(),  # Remove minor gridlines
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    axis.text.x = element_blank(),  # Remove x axis labels
    axis.text.y = element_blank(),  # Remove y axis labels
    axis.ticks = element_blank()  # Remove axis ticks
  )
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conclusion

There are a significant quantity of clusters of high-high and low-low scores with the local Moran’s I test for spatial autocorrelation. I hypothesized that this might be due to the sex education policy of states, however they are only present in certain parts of the United States and do not correspond to any visible pattern on the map I created between average teenage birth rate and sexual education policy by state. This causes me to question if sexual education policy by state is a variable affecting the average teenage birth rate, since there is so much variance by state. The spatial autocorrelation only happens in specific regions, which could be explained by a variety of different factors and would need to be studied in a much more localized manner. Factors could include counties’ school board policies regarding sexual education, rural versus urban, access to contraception, religiosity, culture (north versus south), political ideology, etc. In addition, more knowledge about which of these factors play significant role could help guide future policy to help reduce teenage birth rates in these regions. I would need to do regressions with more data to investigate further.

References

Bivand, R., & Piras, G. (2015). Comparing Implementations of Estimation Methods for Spatial Econometrics. Journal of Statistical Software, 63(18). https://doi.org/10.18637/jss.v063.i18

Create Elegant Data Visualisations Using the Grammar of Graphics. (n.d.). Retrieved May 6, 2024, from https://ggplot2.tidyverse.org/

Deckmyn, O. S. code by R. A. B. and A. R. W. R. version by R. B. E. by T. P. M. and A. (2023). maps: Draw Geographical Maps (3.4.2) [Computer software]. https://cran.r-project.org/web/packages/maps/index.html

FC, M., Davis, T. L., & authors, ggplot2. (2022). ggpattern: “ggplot2” Pattern Geoms (1.0.1) [Computer software]. https://cran.r-project.org/web/packages/ggpattern/index.html

Guttmacher Data Center. (n.d.). Retrieved May 1, 2024, from https://data.guttmacher.org/states/table?state=AL+AK+AZ+AR+CA+CO+CT+DE+DC+FL+GA+HI+ID+IL+IN+IA+KS+KY+LA+ME+MD+MA+MI+MN+MS+MO+MT+NE+NV+NH+NJ+NM+NY+NC+ND+OH+OK+OR+PA+RI+SC+SD+TN+TX+UT+VT+VA+WA+WV+WI+WY&topics=200+198+219+220&dataset=data

Hijmans, R. J., Bivand, R., Pebesma, E., & Sumner, M. D. (2024). terra: Spatial Data Analysis (1.7-71) [Computer software]. https://cran.r-project.org/web/packages/terra/index.html

Holpuch, A. (2016, September 28). US teenage birth rates fall again but still among highest in developed world. The Guardian. https://www.theguardian.com/us-news/2016/sep/28/us-teenage-birth-rates-fall-again

NCHS - Teen Birth Rates for Age Group 15-19 in the United States by County. (2022). [dataset]. Centers for Disease Control and Prevention. https://catalog.data.gov/dataset/nchs-teen-birth-rates-for-age-group-15-19-in-the-united-states-by-county

OpenAI. (2024). ChatGPT [Chatbot]. Retrieved May 6, 2024, from https://www.openai.com/chatgpt

Pebesma, E. (2018). Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal, 10(1), 439. https://doi.org/10.32614/RJ-2018-009

So, A. D., Ivette Ivanov, Jasmine. (2023, March 10). Variation and Effectiveness of Sexual Education in the U.S. ArcGIS StoryMaps. https://storymaps.arcgis.com/stories/04121761023649ad9c3f7c82612f8edb

State Policies on Sex Education in Schools. (n.d.). Retrieved May 1, 2024, from https://www.ncsl.org/health/state-policies-on-sex-education-in-schools

Tennekes, M. (2018). tmap: Thematic Maps in R. Journal of Statistical Software, 84(6). https://doi.org/10.18637/jss.v084.i06

The Differences Between Fertility Rate, Pregnancy Rate, and Live Birth – Fairhaven Health. (n.d.). Retrieved May 1, 2024, from https://www.fairhavenhealth.com/blogs/fairhaven-health-library/the-differences-between-fertility-rate-pregnancy-rate-and-live-birth-rate

Walker, K., Herman, M., & Eberwein, K. (2024). tidycensus: Load US Census Boundary and Attribute Data as “tidyverse” and ’sf’-Ready Data Frames (1.6.3) [Computer software]. https://cran.r-project.org/web/packages/tidycensus/index.html

Walker, K., & Rudis, B. (2024). tigris: Load Census TIGER/Line Shapefiles (2.1) [Computer software]. https://cran.r-project.org/web/packages/tigris/index.html

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., … Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., Bryan, J., Posit, attribution), P. (Copyright holder of all R. code and all C. code without explicit copyright, code), M. K. (Author of included R., code), K. V. (Author of included libxls, code), C. L. (Author of included libxls, code), B. C. (Author of included libxls, code), D. H. (Author of included libxls, & code), E. M. (Author of included libxls. (2023). readxl: Read Excel Files (1.4.3) [Computer software]. https://cran.r-project.org/web/packages/readxl/index.html

Wickham, H., François, R., Henry, L., Müller, K., Vaughan, D., Software, P., & PBC. (2023). dplyr: A Grammar of Data Manipulation (1.1.4) [Computer software]. https://cran.r-project.org/web/packages/dplyr/index.html

Wickham, H., Software, P., & PBC. (2023). stringr: Simple, Consistent Wrappers for Common String Operations (1.5.1) [Computer software]. https://cran.r-project.org/web/packages/stringr/index.html

Wind, R. (2015, January 23). Teen Pregnancy Rates Declined In Many Countries Between The Mid-1990s and 2011 | Guttmacher Institute. https://www.guttmacher.org/news-release/2015/teen-pregnancy-rates-declined-many-countries-between-mid-1990s-and-2011